Lexicon Acquisition: Learning from Corpus by Capitalizing on Lexical Categories

نویسنده

  • Uri Zernik
چکیده

Text examples must be exploited in the acquisition of lexical structures. However, neither syntactic nor semantic features are provided by the text itself, and so acquisition must be aided by additional resources. We investigate the application of an existing resource, a set of lexical categories, as a prediction method. We present an algorithm that applies (a) top-down prediction based on lexical categories; (b) bottom-up validation by scanning text examples. Finally, we discuss the issue of semantic bootstrapping and identify its theoretical and practical limitations. 1 Introduction Existing programs frequently stumble when encountering a new word. Such lexical gaps diminish the utility of natural language technology. The problem is aggravated by the existence of entire unknown phrases composed of single well-known words: (1) John made a table from raw wood. (2) He made a widow happy. (3) He made a happy widow. (4) He made the widow a table. (5) He made her leave early. Each make phrase interacts with its arguments in its own idiosyncratic way. Example (1) presents the simple usage of make: make means generate. Examples (2) and (3), both taken from Love in the Time of Cholera [Marquez, 1986], are more intriguing since similar words combine in entirely different ways: in (2) she (the widow) becomes happy; in (3) he becomes a happy widow. Example (4) introduces the beneficiary interaction: he made it for her. And finally example (5) brings in the complement-taking form: he forced her to act. These examples illustrate how subtle differences in argument structure might impact the entire meaning. A lexicon therefore must account for all the variations such a verb can possibly assume. Idiomatic phrases are not confined to fine literature. The sample sentences next page are taken from the Dow-Jones newswire (July 7, 1988). A brief observation reveals the diversity of phrases used in this technical domain. Make a statement, make a plan, and make a decision fall into one category. Make final net $10,000 falls into a second category. Make it difficult, make it attractive, make it available fall into yet another category. This small collection of sentences illustrates (a) how extensive a lexicon should be to facilitate effective text processing, and (b), the wealth of raw information provided in the text for lexical acquisition purposes. A program processing such text cannot be provided with all these categories at the outset. Therefore, lexical knowledge must be acquired on demand: once …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The production of lexical categories (VP) and functional categories (copula) at the initial stage of child L2 acquisition

This is a longitudinal case study of two Farsi-speaking children learning English: ‘Bernard’ and ‘Melissa’, who were 7;4 and 8;4 at the start of data collection. The research deals with the initial state and further development in the child second language (L2) acquisition of syntax regarding the presence or absence of copula as a functional category, as well as the role and degree of L1 influe...

متن کامل

Submitted to International Conference of Machine Learning , 1996 Lexical Acquisition : A Novel Machine Learning

This paper deenes a new machine learning problem to which standard machine learning algorithms cannot easily be applied. The problem occurs in the domain of lexical acquisition. The ambiguous and synonymous nature of words causes the dii-culty of using standard induction techniques to learn a lexicon. Additionally, negative examples are typically unavailable or diicult to construct in this doma...

متن کامل

Unsupervised Lexical Learning With Categorial Grammars

In this paper we report on an unsupervised approach to learning Categorial Grammar (CG) lexicons. The learner is provided with a set of possible lexical CG categories, the forward and backward application rules of CG and unmarked positive only corpora. Using the categories and rules, the sentences from the corpus are probabilistically parsed. The parses and the history of previously parsed sent...

متن کامل

Deep Lexical Acquisition of Type Properties in Low-resource Languages: A Case Study in Wambaya

We present a case study on applying common methods for the prediction of lexical properties to a low-resource language, namely Wambaya. Leveraging a small corpus leads to a typical high-precision, low-recall system; using the Web as a corpus has no utility for this language, but a machine learning approach seems to utilise the available resources most effectively. This motivates a semi-supervis...

متن کامل

Early lexical development in a self-organizing neural network

In this paper we present a self-organizing neural network model of early lexical development called DevLex. The network consists of two self-organizing maps (a growing semantic map and a growing phonological map) that are connected via associative links trained by Hebbian learning. The model captures a number of important phenomena that occur in early lexical acquisition by children, as it allo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1989